Aim: Propose a model for forest biomass of small-leaved lime trees.

About the data
  • A series of studies sampled the forest biomass in Eurasia.
  • We use part of that data for small-leaved lime trees (Tilia cordata).
  • The data contains the variables:
    • Foliage: the foliage biomass (in kg)
    • DBH: the tree diameter at breast height (in cm)
    • Age: the age of tree (in years)
    • Origin: the origin of the tree (Coppice, Natural, or Planted)

Load the data

  • Is this data experimental or observational?

Explore the data

  • Let’s start by looking at the pairwise plots of the data.
  • Before building the model, let’s take into account some domain context. A foliage mostly grows on the outer canopy, which could be crudely approximated as a spherical shape. The surface area of a sphere is given as \(4\pi r^2\) where \(r\) is radius. So we could consider that \(\texttt{Foliage} \propto 4 \pi (\texttt{DBH}/2)^2 = \pi \texttt{DBH}^2\). Taking log of both sides we have \[\log(\texttt{Foliage}) \approx \text{constant} + 2\log(\texttt{DBH}).\] This suggests that we should take the log transformation for Foliage and DBH. What does the data show?
Answer

The scatter plot shows that the log transformations of Foliage and DBH show a linear relationship. The coefficient of log(DBH) is close to 2.

Model the data

  • Let’s consider the systematic component to be log(Foliage) ~ log(DBH) * Age * Origin for now. Are the two models shown below the same?
  • What about the two models shown below? Are they the same? Why or why not?
Answer

The models are not the same. The first model is modelling the expected value of the log of the response variable, while the second model is modelling the log of the expected response value, i.e. \(E(\log(Y)) \neq \log(E(Y))\). This is a subtle but important distinction.

  • Let’s consider breaking the data into small groups by Age and Origin and look at the relationship between mean and variance for Foliage of each group.
Answer

We have \(\log(\text{group variance}) \approx 2\log(\text{group mean})\). This means that \(\text{group variance} \approx \text{group mean}^2\). Recall that \(Var(\mu) = \mu^2\) for Gamma distribution, so this suggests that the Gamma distribution could be a good choice for modelling the data.

  • Let’s consider the systematic component to be Foliage ~ log(DBH) * Age * Origin and fit a Gamma regression with log link. Do you think Age should be included in the model?
  • Update the model without Age. Should Origin be included in the model?
  • Check some model diagnostics to assess your selected model.
Answer